[WIP] Use `int_scaled_matmul` with `int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC)` #1402

sanchitintel · 2024-12-11T08:31:45Z

Feature

Use int_scaled_matmul with asymmetrically int8 quantized activation & symmetrically int8 quantized weight by applying compensation for zero point of activation.

Motivation

Currently, optimizing GEMMs by using asymmetrically int8 quantized activation & symmetrically quantized weight with the int8_dynamic_activation_int8_weight API poses a problem - torchao currently does not use torch._int_mm for this case, so in case of frozen weights (inference, and Inductor freezing config enabled), the frozen int8 weights are folded into FP32 weights (as aten.mm's second argument) during Inductor's constant-folding passes (during freezing).

With sym act, sym wgt case, torch._int_mm is being used, and that makes it easier to leverage frozen int8 weights with Inductor pattern-matching & use Inductor max-autotune mode.

This PR does something similar for asym act, sym wgt case by using torch._int_mm with int8 activation & weights, and applying compensation corresponding to the activation's zero points.

This change makes it possible to leverage this GEMM's pattern with Inductor pattern-matching & using Inductor max-autotune to fuse the whole GEMM (I'd add its support in a PyTorch PR).

TODO

Report if there's any eager mode slowdown with this approach
Create PyTorch PR & report inference speedup

…mmetrically quantized weights

pytorch-bot · 2024-12-11T08:31:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1402

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/dtypes/uintx/plain_layout.py

jerryzh168 · 2024-12-13T03:28:37Z

torchao/quantization/quant_primitives.py

@@ -888,14 +888,19 @@ def _choose_qparams_affine(
                "preserve_zero == False is not supported for symmetric quantization"
            )
        if (
-            zero_point_domain is not None
+            zero_point_domain != ZeroPointDomain.NONE.name
+            and zero_point_domain != None


I feel we can probably remove support for None since it's the same as ZeroPointDomain.NONE.name

Thanks again for reviewing!

Some other places in the codebase are also using both ZeroPointDomain.NONE.name and None separately:

ao/torchao/quantization/quant_primitives.py

Line 543 in 31234db

), f"dequantiztion with no zero point domain is only supported with FP8 types, got {input.dtype}"

I unified these two cases as one in the latest commit, but I'm not sure if changes in __tensor_unflatten__ & __tensor_flatten__ methods of some classes may be required at some other places in the codebase to ensure that they can deal with a None zero-point when TorchDynamo would be used . I'll run CUDA-only UTs at my end tomorrow morning to verify.

EDIT: Haven't gotten access to an Nvidia GPU until now

jerryzh168 · 2024-12-13T03:30:29Z

torchao/quantization/quant_primitives.py

            )
+        if zero_point_domain == ZeroPointDomain.NONE.name:


thanks for the fix, looks like this is not tested before. can you add a test for the new code path?

also this op is becoming too complicated..we want to split

can you add a test for the new code path?

This case is being tested in a UT I added in test/quantization/test_quant_primitives.py

also this op is becoming too complicated..we want to split

Please advise if you're referring to splitting _choose_qparams_affine.
If so, I could split it up into smaller methods. Thanks!

yeah I meant splitting choose_qparams_affine/quantize_affine/dequantize, not to smaller methods, but to different variations and reduce the complexity of the most common path (and remove these if/else checking), this includes removing preserve_zero, zero_point_domain args and just have different variations of choose_qparams_affine/quantize_affine/dequantize. this should be done separately though since it will be a large change

sanchitintel · 2025-01-13T21:40:09Z

Closing in favor of #1556, which fixes the ZeroPointDomain.NONE implementation. Thanks!

Use int_scaled_matmul with asymmetrically quantized activation and sy…

352211b

…mmetrically quantized weights

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 11, 2024

sanchitintel commented Dec 11, 2024

View reviewed changes

torchao/dtypes/uintx/plain_layout.py Outdated Show resolved Hide resolved

sanchitintel added 5 commits December 12, 2024 15:19

Revise as per review suggestions

082a565

Add UT for ZeroPointDomain.NONE

a434dca

Support both None and ZeroPointDomain.NONE

19ce3b8

Use smaller input shapes in UT

cd51793

Add support for both None and ZeroPointDomain.NONE

9bc2f3e

jerryzh168 reviewed Dec 13, 2024

View reviewed changes

Unify zero_point_domain None and ZeroPointDomain.NONE cases

9cc7ebd

sanchitintel closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use `int_scaled_matmul` with `int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC)` #1402

[WIP] Use `int_scaled_matmul` with `int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC)` #1402

sanchitintel commented Dec 11, 2024 •

edited

Loading

pytorch-bot bot commented Dec 11, 2024

jerryzh168 Dec 13, 2024

sanchitintel Dec 13, 2024 •

edited

Loading

jerryzh168 Dec 13, 2024

sanchitintel Dec 13, 2024

jerryzh168 Dec 13, 2024 •

edited

Loading

sanchitintel commented Jan 13, 2025

[WIP] Use int_scaled_matmul with int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC) #1402

[WIP] Use int_scaled_matmul with int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC) #1402

Conversation

sanchitintel commented Dec 11, 2024 • edited Loading

Feature

Motivation

TODO

pytorch-bot bot commented Dec 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1402

jerryzh168 Dec 13, 2024

Choose a reason for hiding this comment

sanchitintel Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

jerryzh168 Dec 13, 2024

Choose a reason for hiding this comment

sanchitintel Dec 13, 2024

Choose a reason for hiding this comment

jerryzh168 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

sanchitintel commented Jan 13, 2025

[WIP] Use `int_scaled_matmul` with `int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC)` #1402

[WIP] Use `int_scaled_matmul` with `int8_dynamic_activation_int8_weight(act_mapping_type=MappingType.ASYMMETRIC)` #1402

sanchitintel commented Dec 11, 2024 •

edited

Loading

sanchitintel Dec 13, 2024 •

edited

Loading

jerryzh168 Dec 13, 2024 •

edited

Loading